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Abstract 

Riesenhuber  &  Poggio  recently  proposed  a  model  of  object  recognition  in  cortex  which,  beyond  integrat¬ 
ing  general  beliefs  about  the  visual  system  in  a  quantitative  framework,  made  testable  predictions  about 
visual  processing.  In  particular,  they  showed  that  invariant  object  representation  could  be  obtained  with  a 
selective  pooling  mechanism  over  properly  chosen  afferents  through  a  MAX  operation:  For  instance,  at  the 
complex  cells  level,  pooling  over  a  group  of  simple  cells  at  the  same  preferred  orientation  and  position  in 
space  but  at  slightly  different  spatial  frequency  would  provide  scale  tolerance,  while  pooling  over  a  group 
of  simple  cells  at  the  same  preferred  orientation  and  spatial  frequency  but  at  slightly  different  position  in 
space  would  provide  position  tolerance.  Indirect  support  for  such  mechanisms  in  the  visual  system  comes 
from  the  ability  of  the  architecture  at  the  top  level  to  replicate  shape  tuning  as  well  as  shift  and  size  invari¬ 
ance  properties  of  'View-tuned  cells"  (VTUs)  found  in  inferotemporal  cortex  (IT),  the  highest  area  in  the 
ventral  visual  stream,  thought  to  be  crucial  in  mediating  object  recognition  in  cortex.  There  is  also  now 
good  physiological  evidence  that  a  MAX  operation  is  performed  at  various  levels  along  the  ventral  stream. 
However,  in  the  original  paper  by  Riesenhuber  &  Poggio,  tuning  and  pooling  parameters  of  model  units 
in  early  and  intermediate  areas  were  only  qualitatively  inspired  by  physiological  data.  Many  studies  have 
investigated  the  tuning  properties  of  simple  and  complex  cells  in  primary  visual  cortex,  VI.  We  show  that 
units  in  the  early  levels  of  HMAX  can  be  tuned  to  produce  realistic  simple  and  complex  cell-like  tuning, 
and  that  the  earlier  findings  on  the  invariance  properties  of  model  VTUs  still  hold  in  this  more  realistic 
version  of  the  model. 
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1  Introduction 

Extending  previous  models  of  object  recognition  in  cor¬ 
tex  [1,  2],  Riesenhuber  &  Poggio  have  shown  that  in¬ 
variant  object  representations  (similar  to  the  ones  found 
in  inferotemporal  (IT)  cortex)  could  be  explained  by  the 
combined  action  of  two  operations: 

A  weighted  linear  summation  i.e.,  units  performing  a 
weighted  linear  summation  (followed  by  a  Gaus¬ 
sian  nonlinearity)  over  afferents  tuned  to  different 
features  (equivalent  to  template  matching)  would 
be  well  suited  to  explain  the  increase  in  complex¬ 
ity  of  the  optimal  stimulus  driving  cells  en  route  to 
object  recognition. 

A  MAX  operation  i.e.,  units  performing  a  non-linear 
MAX  operation  over  afferents  tuned  to  slightly  dis¬ 
torted  versions  of  the  same  feature  (shifted  and 
rescaled)  should  provide  the  substrate  for  building 
increasingly  invariant  representations. 

In  a  benchmark  simulation  [3],  Riesenhuber  &  Pog¬ 
gio  ''recorded"  from  the  HMAX  model  (see  Fig.  1)  and 
showed  that  the  range  of  invariances  exhibited  by  the 
model  VTUs  (named  after  the  view-tuned  units  in  IT) 
was  compatible  with  shift,  size  and  depth  rotation  tun¬ 
ing  properties  of  view-tuned  cells  [3, 4]. 

Additionally,  biophysically  plausible  implementa¬ 
tions  of  the  MAX  operation  have  been  proposed  [5]  and 
neurons  performing  a  MAX  operation  have  been  found 
in  area  V4  in  the  primate  [6],  and  very  recently  also  in 
complex  cells  in  cat  visual  cortex  [7].  The  latter  study 
showed  that,  consistent  with  the  model  prediction,  the 
response  of  complex  cells  elicited  by  the  simultaneous 
presentation  of  two  bars  (one  optimal  and  one  non- 
optimal),  closely  matches  the  response  of  the  cells  when 
presented  with  the  optimal  stimulus  alone. 

In  the  original  paper  by  Riesenhuber  &  Poggio,  tun¬ 
ing  and  pooling  parameters  of  model  units  in  early  and 
intermediate  areas  were  only  qualitatively  inspired  by 
physiological  data.  In  particular,  many  studies  have  in¬ 
vestigated  the  tuning  properties  of  simple  and  complex 
cells  in  primary  visual  cortex  VI.  We  now  take  a  de¬ 
tailed  look  at  the  compatibility  of  the  model  with  pop¬ 
ulation  tuning  at  the  simple  and  complex  cells  level. 

We  start  by  improving  the  fit  between  model  sim¬ 
ple  (SI)  units  (whose  tuning  properties  in  the  original 
model  were  chosen  to  just  qualitatively  resemble  VI 
simple  cell  shape)  and  the  experimental  data.  In  par¬ 
ticular,  we  show  that  a  better  account  of  the  simple  cells 
population  spread  of  tuning  can  be  obtained  with  prop¬ 
erly  parameterized  Gabor  functions. 

We  further  show  that  starting  with  a  representative 
distribution  of  simple  cell  tuning  properties,  it  is  possi¬ 
ble  to  adjust  two  of  the  main  model  parameters  (spa¬ 
tial  and  frequency  extent  of  the  afferent  simple  cells, 
see  Fig.l)  such  that  the  corresponding  set  of  complex 


(Cl)  units  tuning  properties  is  compatible  with  the  VI 
complex  cells.  In  particular,  we  find  that  the  increase 
in  receptive  field  size  [8]  and  spatial  frequency  band¬ 
width  [9,  10]  could  be  well  accounted  by  the  pooling 
mechanisms  proposed  in  HMAX  in  order  to  gain  size 
and  shift  tolerance  at  the  Cl  level. 

As  a  benchmark  for  our  model  units,  we  consider 
tuning  properties  of  parafoveal  cells  in  monkey  as  re¬ 
ported  by  two  groups:  De  Valois  et  al.  [9,  11]  and 
Schiller  et  al.  [10,  12,  13].*  Focusing  on  this  new  set  of 
SI  and  Cl  cells,  we  use  a  benchmark  paperclip  recog¬ 
nition  task  as  in  [3,  4]  and  show  that  the  model  is  still 
able  to  replicate  tuning  properties  of  view-tuned  cells  in 
IT,  suggesting  that  the  model  is  robust  to  changes  in  the 
low  levels. 


2  Methods 


2.1  Original  HMAX 

The  precise  architecture  of  HMAX  has  been  described 
in  details  elsewhere  [3,  14-16]  and  we  here  only  high¬ 
lights  important  features  of  the  model  (see  Fig.  1).  We 
first  briefly  describe  the  two  first  layers  of  HMAX  under 
study,  that  is,  simple  (SI)  cells  and  complex  (Cl)  cells. 
We  then  highlight  the  other  two  layers  of  the  model 
(S2  and  C2)  for  further  understanding  on  training  the 
VTUs  in  the  benchmark  recognition  task  (sections  2.3.5 
and  3.3). 

Simple  (SI)  cells.  Input  images  (160  x  160  gray  im¬ 
ages  corresponding  to  4.4°  of  visual  angle,  see  [14])  are 
densely  sampled  by  arrays  of  two-dimensional  filters 
Gx,y  (second  derivative  of  Gaussians)  that  can  be  ex¬ 
pressed  as: 


G 


x,y 


{—xcosO  +  ysinOy 
(72(a2-l) 


,  (x  cos  0  -\-ysm  0)‘^  +  i—x  cos  0  -\-ysm  0)‘^ 

exp  ( - 2^^ - ). 


Table  1  details  the  values  of  the  two  filters  parame¬ 
ters:  orientation  0  and  width  a.  The  response  of  the 
so-called  SI  units,  sensitive  to  bars  of  different  orien¬ 
tations,  thus  roughly  resembling  properties  of  simple 
cells  in  striate  cortex,  is  given  by  centering  filters  of 
each  size  and  orientation  at  each  pixel  of  the  input  im¬ 
age.  The  filters  are  sum-normalized  to  zero  and  square- 
normalized  to  1  so  that  SI  cells  activity  is  between  -1 
and  1,  modeling  simple  cells  of  phase  0  and  tt. 

While  non-biological  (both  in  its  implementation  and 
because  it  neglects  the  response  saturation  of  VI  cells 

*We  considered  parafoveal  cells,  as  further  studies  of 
higher  brain  areas  (V4,  for  instance)  mostly  focused  on 
parafoveal  cells  population,  although  differences  between  the 
two  groups  are  not  always  significant:  Parafoveal  cells  tend  to 
have  slightly  larger  receptive  fields,  are  slightly  more  broadly 
tuned  to  spatial  frequency  and  tend  to  be  tuned  to  lower- 
spatial  frequencies. 
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observed  at  high  contrast  [17-19]),  this  simplification  is 
convenient  and  does  not  interfere  in  our  experiments  as 
we  work  with  fixed  contrast.  Fig.  2  shows  all  simple 
(SI)  cells  receptive  field  types  used  in  standard  HMAX. 

Complex  (Cl)  cells.  One  prediction  made  by  the 
model  is  that  complex  cells  are  phase  invariant  as  well 
as  size  and  position  tolerant.  Fig.  1  describe  how  size 
and  position  invariance  are  increased  in  the  model.  The 
mechanisms  rely  on  a  non-linear  MAX  operation  (or  its 
soft-MAX  approximation,  [14])  over  properly  chosen  af- 
ferents,  z.c.,  a  Cl  unit's  activity  is  determined  by  the 
strongest  input  it  receives. 

For  instance,  pooling  over  simple  (SI)  cells  at  the 
same  preferred  orientation  but  responding  to  bars  of 
different  lengths,  provide  invariance  with  respect  to 
changes  in  size  (see  Fig.  IB.).  The  amount  of  invariance 
gained  is  determined  by  the  range  of  sizes  (or  equiv¬ 
alently  spatial  frequency  selectivities)  over  which  the 
MAX  is  performed.  We  call  this  filter  bands,  i.e.,  groups 
of  SI  filters  of  a  certain  size  range.  In  standard  HMAX  , 
four  filter  bands  are  used  in  which  filter  sizes  are  within 
the  range: 

ScaleRange  =  {7  -  9;  11  -  15;  17  -  21;  23  -  29}  (1) 

Similarly,  position  invariance  is  increased  by  pool¬ 
ing  over  SI  cells  at  the  same  preferred  orientation  but 
whose  receptive  fields  are  centered  on  neighboring  lo¬ 
cations,  i.e.,  within  each  filter  band,  a  pooling  range  is 
defined  which  determines  the  size  of  the  array  of  neigh¬ 
boring  SI  units  of  all  sizes  in  that  filter  band  which  feed 
into  a  Cl  unit  (see  Fig.  1  A.).  It  is  important  to  mention 
that  only  SI  filters  with  the  same  preferred  orientation 
feed  into  a  given  Cl  unit  to  preserve  feature  specificity. 
In  standard  HMAX  ,  the  pooling  ranges  for  each  of  the 
four  filter  bands  are  such  that: 

PoolRange  =  {4;  6;  9;  12}  (2) 

As  a  result,  a  Cl  unit  responds  best  to  a  bar  of  the 
same  orientation  as  the  SI  units  that  feed  into  it,  but 
already  with  an  amount  of  spatial  and  size  invariance 
that  corresponds  to  the  spatial  and  filter  size  pooling 
ranges  used  for  a  Cl  unit  in  the  respective  filter  band. 
Additionally,  Cl  units  are  invariant  to  contrast  reversal, 
much  as  complex  cells  in  striate  cortex,  by  pooling  over 
on  and  off  simple  cells  (before  performing  the  MAX  op¬ 
eration).  Possible  firing  rates  of  a  Cl  unit  thus  range 
from  0  to  1. 

S2  cells.  A  square  of  four  adjacent,  non-overlapping 
Cl  units  belonging  to  the  same  filter  band,  in  a  2  x  2  ar¬ 
rangement,  is  grouped  to  provide  input  to  each  S2  unit. 
There  are  256  different  types  of  S2  units  in  each  filter 
band,  corresponding  to  the  4^  possible  arrangements  of 
four  Cl  units  of  each  of  four  types  (i.e.,  preferred  bar  ori¬ 
entation).  The  S2  unit  response  function  is  a  Gaussian 
with  mean  1  (i.e.,  {1,1,1,!})  and  standard  deviation  1, 


i.e.,  an  S2  unit  has  a  maximal  firing  rate  of  1  which  is 
attained  if  each  of  its  four  afferents  fires  at  a  rate  of  1  as 
well.  S2  units  provide  the  feature  dictionary  of  HMAX  , 
in  this  case  all  combinations  of  2  x  2  arrangements  of 
"bars"  (more  precisely.  Cl  cells)  at  four  possible  orien¬ 
tations. 

It  is  worth  noting  that  those  choices  of  S2  units'  pa¬ 
rameters  remain  somewhat  arbitrary.  This  reflects  the 
lack  of  a  precise  characterization  of  the  response  prop¬ 
erties  of  cells  in  intermediate  layers  of  visual  cortex.  In¬ 
deed,  current  work  is  trying  to  improve  the  fit  between 
S2  units  in  HMAX  and  biological  neurons  in  V4  [20,  21]. 
We  also  showed  in  [22]  that  S2  units  centers  could  be 
learned  in  order  to  perform  robust  real-world  object 
recognition. 

C2  cells.  To  finally  achieve  size  invariance  over  all  fil¬ 
ter  sizes  in  the  four  filter  bands  and  position  invari¬ 
ance  over  the  whole  input  image,  the  S2  units  are  again 
pooled  by  a  MAX  operation  to  yield  C2  units,  the  output 
units  of  the  HMAX  core  system,  designed  to  correspond 
to  neurons  in  extrastriate  visual  area  V4  or  posterior  IT 
(PIT).  There  are  256  C2  units,  each  of  which  pools  over 
all  S2  units  of  one  type  at  all  positions  and  scales.  Con¬ 
sequently,  a  C2  unit  will  fire  at  the  same  rate  as  the  most 
active  S2  unit  that  is  selective  for  the  same  combination 
of  four  bars,  but  regardless  of  its  scale  or  position. 

View-tuned  units.  C2  units  in  turn  provide  input  to 
the  view-tuned  units  (VTUs),  named  after  their  prop¬ 
erty  of  responding  well  to  a  specific  two-dimensional 
view  of  a  three-dimensional  object,  thereby  closely  re¬ 
sembling  the  view-tuned  cells  found  in  monkey  infer- 
otemporal  cortex  by  Logothetis  et  al.  [4].  The  C2  ^  VTU 
connections  are  so  far  the  only  stage  of  the  HMAX  model 
where  learning  occurs  (but  see  [22]  for  a  method  to  learn 
S2  features  with  HMAX  in  the  context  of  an  object  detec¬ 
tion  task). 

A  VTU  is  tuned  to  a  stimulus  by  selecting  the  activi¬ 
ties  of  the  N  C2  units  (all  256  or  a  subset)  in  response  to 
that  stimulus  as  the  center  of  an  V-dimensional  Gaus¬ 
sian  response  function,  yielding  a  maximal  response  of 
1  for  a  VTU  in  case  the  C2  activation  pattern  exactly 
matches  the  C2  activation  pattern  evoked  by  the  train¬ 
ing  stimulus  ^ . 

2.1.1  New  HMAX 

SI  cells.  We  here  motivate  the  use  of  Gabor  func¬ 
tions  to  model  simple  cells  receptive  field  instead  of 
the  Gaussian  derivatives  as  in  standard  HMAX.  For  the 
past  decade,  Gabor  filters  have  been  extensively  used  to 

^We  here  consider  the  simplest  way  to  train  a  set  of  VTUs 
from  data  as  in  [3].  The  method  is  closely  related  to  RBF  net¬ 
works  for  which  a  function  is  approximated  by  a  weighted 
sum  of  basis  functions  centered  on  each  data  points  (or  a  sub¬ 
set  of  the  training  data).  More  complex  schemes  include  a 
search  of  the  VTU  centers  as  in  generalized  RBF  network  for 
instance. 
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Figure  1:  Left:  Schematic  of  the  model.  Two  types  of  computations  i.e.,  linear  summation  and  non-linear  MAX 
operation  alternate  between  layers.  Input  images  are  first  densely  sampled  by  arrays  of  two-dimensional  filters  at 
four  different  orientations,  the  simple  (SI)  units.  Within  a  pooling  band,  SI  cells  (z.c.,  a  group  of  cells  at  the  same 
preferred  orientation  but  at  slightly  different  scales  and  positions,  see  text)  feed  into  complex  (Cl)  cells  through  a 
MAX  operation  (see  right  figure  for  illustration).  In  the  next  (S2)  level,  and  within  each  filter  band,  a  square  of  four 
adjacent,  non  overlapping  Cl  units  in  a  2  x  2  arrangement  is  grouped  to  provide  input  to  an  S2  unit.  To  finally 
achieve  size  invariance  over  all  filter  sizes  in  the  four  filter  bands  and  position  invariance  over  the  whole  input 
image,  the  S2  units  are  again  pooled  by  a  MAX  operation  to  yield  C2  units  that  again  provide  input  to  the  view- 
tuned  units  (VTUs).  Right:  Schematic  of  how  size  and  shift  tolerances  are  increased  at  the  (Cl)  level:  A  complex 
(Cl)  cell  pools  over  SI  cells  (within  a  pooling  band,  see  text)  at  the  same  orientation  but  A)  centered  at  different 
location  thus  providing  some  translation  invariance  and  B)  at  different  scales  providing  some  scale  invariance  to 
the  complex  cell. 
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Figure  2:  Top:  Model  simple  cells  receptive  field  used  in  standard  HMAX  [14].  Receptive  field  sizes  range  from  0.19^ 
to  0.8^  at  four  different  orientations.  Bottom:  Modeling  simple  cells  receptive  field  with  Gabor  functions.  Receptive 
field  sizes  range  from  0.19^  to  1.07^  at  four  different  orientations.  In  order  to  obtain  receptive  field  sizes  within 
the  bulk  of  the  simple  cell  receptive  fields  (0.1°  -1°  )  reported  in  [8, 12],  we  cropped  the  Gabor  receptive  fields  and 
applied  a  circular  mask  so  that,  for  a  given  parameter  set  (A,  a),  cell  tuning  properties  are  independent  of  their 
orientations.  Note  that  receptive  fields  were  set  on  a  gray  background  for  display  only  and  so  that  relative  sizes 
were  preserved. 
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model  the  receptive  fields  of  simple  cells.  Gabor  func¬ 
tions  have  been  shown  to  be  solutions  of  an  optimiza¬ 
tion  problem  that  is  minimizing  simultaneously  uncer¬ 
tainty  in  both  position  and  spatial  frequency  [23]  and  to 
fit  well  with  physiological  data  recorded  from  cat  striate 
cortex  [24].  We  here  motivate  the  use  of  Gabor  functions 
to  model  cortical  simple  cell  receptive  fields  because 
they  have  more  free  parameters  and  allow  more  ac¬ 
curate  tuning  than  their  homologue  (Gaussian  deriva¬ 
tives)  used  in  standard  HMAX  (see  section  3  for  a  com¬ 
parison  between  the  two). 

Placing  the  origin  of  the  x  and  y  axis  coordinates  at 
the  center  of  the  receptive  field,  the  filter  response  is 
given  by: 


Gx,y  =  exp  - 


{xcosO  +  ^sin^)^  +  7^(— xsin^  +  ycosOy 


2cr2 


X  cos  (  27r  —  {x  cos  0  y  sin^)  +  0 
A 


We  empirically  found  that  one  way  to  account  for 
all  three  properties  was  to  include  fewer  cycles  in  the 
units'  receptive  fields  as  their  sizes  (PF  size)  increase. 
We  found  that  the  two  following  (ad  hoc)  formulas  gave 
good  agreement  with  the  tuning  properties  of  cortical 
cells: 


a 

X 


0.0036  *  RF  size^  -F  0.35  *  RF  size  +  0.18 

(j 

08 


(3) 

(4) 


Table  1  gives  the  values  of  parameters  that  determine 
Gabor  filter  tuning  properties  and  how  they  differ  from 
those  in  standard  HMAX  (Gaussian  derivatives). 

For  all  cells  with  a  given  set  of  parameters  (Aq,  ctq) 
to  share  similar  tuning  properties  at  all  orientations,  we 
applied  a  circular  mask  to  the  Gabor  filters  (see  Fig.  2 
.  bottom)  which  was  not  done  in  standard  HMAX  .  Crop- 
I  ping  Gabor  filters  to  a  smaller  size  than  their  effective 
length  and  width,  we  found  that  the  aspect  ratio  7  had 
.  only  a  limited  effect  on  the  cells  tuning  properties  and 
was  fixed  to  0.3  for  all  filters. 


The  five  parameters,  z.c.,  orientation  6>,  aspect  ratio 
7,  effective  width  a,  phase  0  and  wavelength  A  deter¬ 
mine  the  properties  of  the  cells  spatial  receptive  fields. 
The  tuning  of  simple  cells  in  cortex  along  these  dimen¬ 
sions  varies  substantially.  Rather  than  attempting  to 
replicate  the  precise  distribution  (which  differs  between 
the  different  studies),  our  aim  is  to  show  that  model  SI 
unit  tuning  can  capture  more  robust  statistics  (such  as 
sample  mean  or  median)  and  the  range  of  experimental 
neurons. 

As  in  standard  HMAX,  we  considered  four  orienta¬ 
tions  only  (6>  =  0° ,  45° ,  90° ,  and  135° ).  This  is  an  over¬ 
simplification  but  this  has  been  shown  to  be  sufficient 
to  provide  rotation  and  size  invariance  at  the  VTU  level 
in  good  agreement  with  recordings  in  IT  [3].  (j)  was  set 
to  0°  while  different  phases  are  crudely  approximated 
by  centering  receptive  fields  at  all  locations. 

In  order  to  obtain  receptive  field  sizes  consistent  with 
values  reported  for  parafoveal  simple  cells  [12],  we  in¬ 
creased  the  number  of  filter  sizes  covered  with  standard 
HMAX  leading  to  17  filters  sizes  from  7x7  (0.19°  visual 
angle)  to  39  x  39  (1.07°  visual  angle)  obtained  by  steps 
of  two  pixels  instead  of  the  12  filters  sizes  ranging  be¬ 
tween  7x7  (0.19°  visual  angle)  and  29  x  29  (0.80°  visual 
angle)  as  in  standard  HMAX  . 

When  fixing  the  values  of  the  remaining  3  parameters 
(7,  A  and  a),  we  tried  to  account  for  general  cortical  cell 
properties,  that  is:  (i)  Cortical  cells'  peak  frequency  se- 
lectivities  are  negatively  correlated  with  their  receptive 
field  sizes  [10].  (ii)  Cortical  cells'  spatial  frequency  se¬ 
lectivity  bandwidths  are  positively  correlated  with  their 
receptive  field  sizes  [10].  (iii)  Cortical  cells  orientation 
bandwidths  are  positively  correlated  with  their  recep¬ 
tive  field  sizes  [13]. 


Cl  cells.  In  order  to  better  account  for  complex  cells 
tuning  properties,  we  assigned  new  values  to  the  two 
parameters  ScaleRange  and  PoolRange  that  control  the 
filter  bands  in  HMAX  (see  section  2.1).  The  number  of 
filter  bands  was  increased  from  4  to  8  while  the  number 
of  filters  within  each  filter  bands  was  decreased  (from 
3  to  2  in  each  band)  thus  providing  less  scale  toler¬ 
ance  (therefore  narrower  spatial  frequency  bandwidth) 
to  complex  cells.  Values  for  the  PoolRange  variables 
varied  from  8  to  22  and  new  values  were  assigned  to 
ScaleRange: 

PoolRange  =  {8;  10;  12;  14;  16;  18;  20;  22}  (5) 

ScaleRange  =  {7  -  9;  11  -  13;  15  -  17;  19  -  21; 

23  -  25;  27  -  29;  31  -  33;  35  -  39}  (6) 


standard  HMAX 

Gabor  filters 

RF  size 

7  X  7  ^  29  X  29 

7  X  7  ^  39  X  39 

(receptive  field  size) 

12  filters  in  steps  of  2 

17  filters  in  steps  of  2 

9  (orientation) 

n  2L  2L  ^ 

4  5  9  5  4 

Q  TT  TT  ^ 

a 

RF  size/4 

aRF  size‘s  +  bRF  size  +  c 

a  =  0.0036;  b  =  0.35;  c  =  0.18 

(effective  width) 

1. 8-7.3 

2.8-19.5 

7  (aspect  ratio) 

1 

0.3 

A 

N/A 

cr/0.8 

(wavelength) 

N/A 

3.5-24.4 

Table  1:  Comparison  between  parameters  used  in  stan¬ 
dard  HMAX  to  model  simple  (SI)  cells  with  Gaussian 
derivatives  and  the  ones  used  to  model  simple  (SI)  cells 
with  Gabor  filters  to  better  account  for  properties  of 
parafoveal  simple  cells. 
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Figure  3:  Top:  Filters  (Gabor  (left)  and  Gaussian  deriva¬ 
tives  (right))  and  preferred  bar  stimulus  superimposed. 
Bottom:  Corresponding  orientation  tuning  curves  ob¬ 
tained  with  optimal  bars,  gratings  and  edges.  The  three 
stimuli  produced  similar  curves  with  Gabor  filters  but 
not  with  Gaussian  derivatives  as  simple  (SI)  units  tend 
to  select  shorter  and  wider  bars. 


2.2  Assessing  model  unit  tuning  properties 

2.2.1  Orientation  tuning 

Orientation  tuning  was  assessed  in  two  ways:  First, 
following  [11],  we  swept  sine  wave  gratings  of  opti¬ 
mal  frequency  over  the  receptive  field  of  a  model  unit 
at  thirty-six  different  orientations  (spanning  180^  of  the 
visual  field  in  steps  of  5^).  For  each  cell  tested,  the 
maximum  response  elicited  for  each  orientation  was 
recorded  to  fit  a  tuning  curve  and  the  orientation  band¬ 
width  at  half-amplitude  was  calculated.  For  compari¬ 
son  with  [13],  we  also  swept  edges  and  bars  of  optimal 
dimensions:  For  each  cell  the  orientation  bandwidth  at 
71%  of  the  maximal  response  was  calculated  as  in  [13]. 

Sweeping  edges,  bars  and  gratings  gave  similar  tun¬ 
ing  curves  for  Gabor  filters,  suggesting  that  if  simple 
cells  can  be  well  modeled  by  Gabor  filters,  measure¬ 
ments  made  by  groups  with  different  stimuli  (bars, 
grating  and  edges)  are  indeed  consistent.  Bar  stimuli 
with  Gaussian  derivatives  as  in  standard  HMAX  ,  how¬ 
ever  lead  to  inconsistent  tuning  curves  compared  with 
edges  and  gratings,  indicating  that  Gaussian  deriva¬ 
tives  are  a  poor  model  of  simple  cell  processing. 

2.2.2  Spatial  frequency  tuning 

Spatial  frequency  selectivity  was  assessed  by  sweep¬ 
ing  sine  wave  gratings  of  various  spatial  frequencies 
over  a  model  unit's  receptive  field.  For  each  grating 
frequency,  the  maximal  cell  response  was  recorded  to 
fit  a  tuning  curve  and  the  spatial  frequency  selectivity 
bandwidth  was  calculated  as  in  [9]  by  dividing  the  fre¬ 
quency  score  at  the  high  crossover  of  the  curve  at  half¬ 
amplitude  by  the  low  crossover  at  the  same  level. 


Taking  the  log2  of  this  ratio  gives  the  bandwidth 
value  (in  octaves): 

bandwidth  =  log2  (7) 

low  cut  ^  ^ 

For  comparison  with  [10],  we  also  calculated  the  selec¬ 
tivity  index  as  defined  in  [10],  by  dividing  the  frequency 
score  at  the  high  crossover  of  the  curve  at  71%  of  the 
maximal  amplitude  by  the  low  crossover  at  the  same 
level  and  multiplying  this  value  by  100  (a  value  of  50 
representing  a  specificity  of  1  octave): 

selectivity  index  =  x  100  (8) 

low  cut 

2.3  Benchmark  paperclip  recognition  task 

2.3.1  Stimuli 

To  test  translation,  size  and  rotation  invariance  prop¬ 
erties  of  the  VTUs,  we  used  80  out  of  a  set  of  200  "paper¬ 
clip"  stimuli  (20  targets,  60  distracters)  similar  to  those 
used  previously  in  [3, 4].  Examples  of  paperclip  stimuli 
are  shown  in  Fig.  4.  The  background  pixel  value  was 
always  set  to  zero  (contrast  100%),  as  in  [3, 4]. 

2.3.2  Shift 

To  examine  shift  invariance,  we  trained  VTUs  to  each 
of  the  20  target  paperclips  at  size  64  x  64  pixels,  posi¬ 
tioned  at  the  center  of  the  160  x  160  pixel  input  image. 
We  then  calculated  C2  and  VTU  responses  for  all  pa¬ 
perclips  at  eight  random  positions  around  the  reference 
position.  An  example  of  tested  positions  for  one  paper 
clip  (positions  varied  from  one  paperclip  to  another)  is 
shown  Fig.  4a. 

2.3.3  Scaling 

To  examine  size  invariance,  we  trained  VTUs  to  each 
of  the  20  target  paperclips  at  size  64  x  64  pixels,  po¬ 
sitioned  at  the  center  of  the  160  x  160  pixel  input  im¬ 
age.  We  then  calculated  C2  and  VTU  responses  for 
all  paperclips  at  different  sizes,  in  quarter-octave  steps 
(z.c.,  squares  with  edge  lengths  of  27,  32,  38,  45,  54,  64, 
76,  91  108,  129  and  154  pixels),  again  positioned  at  the 
center  of  the  160  x  160  input  image.  Examples  of  three 
paperclips  rescaled  by  ±  1  octave  from  reference  (cen¬ 
ter)  are  shown  in  Fig.  4b. 

2.3.4  Rotation 

To  examine  invariance  to  rotation  in  depth,  we 
trained  VTUs  to  each  of  the  20  target  paperclips  at 
0°  rotation  and  size  64  x  64  pixels,  positioned  at  the  cen¬ 
ter  of  the  input  image  (160  x  160).  We  then  calculated 
C2  and  VTU  responses  for  all  paperclips  at  different  ro¬ 
tations  from  the  origin  (±  50°  by  steps  of  4°  ).  Examples 
of  three  paperclips  at  -20° ,  0°  and  +20°  are  shown  in 
Fig.  4c. 
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Figure  4:  Stimulus  transformations  to  test  a)  shift  invariance  (all  tested  positions),  b)  scaling  invariance  (each  row 
shows  a  reference  paperclip  rescaled  by  ±  1  octave  -left  and  right-)  and  c)  3-D  rotation  invariance  (reference  paper¬ 
clip  rotated  by  ±  20°  -left  and  right-).  For  all  invariance  tests,  the  reference  is  the  64  x  64  pixel  center  paperclip. 


2.3.5  Task 

To  assess  the  degree  of  invariance  to  stimulus  trans¬ 
formations,  we  used  a  paradigm  similar  to  the  one  used 
in  [3,  4],  in  which  a  transformed  (rescaled  or  rotated 
in  depth)  target  stimulus  is  considered  recognized  in  a 
certain  presentation  condition  if  the  VTU  tuned  to  the 
original  target  (default  size  and  view),  responds  more 
strongly  to  its  presentation  than  to  the  presentation  of 
any  distracter  stimulus.  This  measures  the  hit  rate  at 
zero  false  positives. 

3  Results 

3.1  Original  HMAX 

3.1.1  Spatial  frequency  tuning 

SI  units.  We  found  that  simple  cells  in  origi¬ 
nal  HMAX  were  too  broadly  tuned  to  spatial  fre¬ 
quency:  Spatial  frequency  bandwidth  measured  at  half¬ 
amplitude  was  about  1.7  octaves  for  all  units.  De  Valois 
et  al.  report  a  median  value  of  1.32  [9])  for  parafoveal 
simple  cells,  with  most  cells  lying  around  1-1.5  octaves. 
We  found  a  similar  discrepancy  between  model  units 
and  cortical  cells  from  data  collected  by  Schiller  et  al. 
who  report  spatial-frequency  selectivity  index  values  in 
the  range  of  40-80.  (HMAX  cells  index  values  vary  be¬ 
tween  34  and  41). 

Because  Gaussian  derivatives  only  have  one  free  pa¬ 
rameter,  we  found  it  impossible  to  have  them  match 
both  simple  cells  spatial  frequency  distribution  and 
bandwidth.  Setting  a  so  that  spatial  frequency  se- 
lectivities  of  the  two  populations  match  [9]  (1-5.6  for 
parafoveal  cells  vs.  1.4-5. 8  cycles/ degree  as  in  standard 
HMAX  )  lead  to  overly  broad  spatial  frequencies  tuning 
profiles  while  setting  a  so  that  spatial  frequencies  band¬ 
width  match  lead  to  peak  frequencies  too  high.  This 
motivates  the  use  of  functions  with  more  degrees  of 
freedom  such  as  Gabor  functions. 

Cl  units.  Similarly,  we  found  that  complex  cells  were 
too  broadly  tuned  to  spatial  frequency  with  a  me¬ 
dian  spatial  frequency  bandwidth  measured  at  half¬ 
amplitude  around  2.1  octaves  (range:  2. 0-2.2  octaves) 


which  is  high  compared  to  a  value  of  1.6  for  Y  cells 
parafoveal  reported  in  [9].  Similarly,  the  spatial  fre¬ 
quency  index  was  around  30  and  therefore  lay  outside 
the  bulk  (30-70)  reported  in  [10]. 

3.1.2  Orientation  tuning 

SI  units.  As  in  section  3.1.1  for  spatial  frequency,  we 
found  that  Gaussian  derivatives  could  not  account  for 
simple  cell  orientation  tuning  properties.  Measured  at 
half-amplitude,  we  found  an  orientation  tuning  band¬ 
width  of  97°  for  all  cells  while  De  Valois  et  al.  report  a 
median  value  of  34°  (range  20°  -  90°  ).  Even  though  the 
value  reported  is  surprisingly  low  (parafoveal  simple 
cells  would  thus  be  more  narrowly  tuned  than  foveal 
simple  and  complex  cells),  the  discrepancy  is  still  large 
when  compared  to  data  collected  by  Schiller  et  al.  who 
report  a  bulk  in  the  range  20°  -50°  [13]  (measured  at 
71%  of  the  maximal  response  with  edges  and  bars) 
whereas  HMAX  unit  orientation  bandwidth  calculated 
in  this  way  was  about  69°  . 

Cl  units.  Consistent  with  the  fact  that  all  model  sim¬ 
ple  cells  share  similar  orientation  tuning  properties  and 
since  complex  cells  pool  over  simple  cells  at  the  same 
preferred  orientation,  we  found  that  HMAX  Cl  orienta¬ 
tion  tuning  was  identical  to  those  of  SI  units  (97°  at  half 
amplitude  and  69°  at  71%  max  amplitude). 

3.2  New  HMAX  with  Gabor  filter  sets 
3.2.1  Spatial  frequency  tuning 

SI  units.  As  described  in  section  2.1.1,  Gabor  filter 
peak  frequencies  are  parameterized  by  the  inverse  of 
their  wavelength  u  =  j  (z.c.,  the  wavelength  of  the 
modulating  sinusoid).  We  found  that  the  values  mea¬ 
sured  experimentally  by  sweeping  optimally  oriented 
gratings  were  indeed  close  to  z/.  As  expected  (see  sec¬ 
tion  2.1.1),  we  also  found  a  positive  correlation  between 
receptive  field  size  and  frequency  bandwidth,  as  well 
as  a  negative  correlation  with  peak  frequency  selectiv- 
ities,  which  is  consistent  with  recordings  made  in  pri¬ 
mate  striate  cortex  [9, 10]. 
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Figure  5:  Coverage  of  the  spatial  frequency  plane  by  Gabor  function  (a)  and  Gaussian  derivatives  as  in  standard 
HMAX  (b).  The  length  of  the  ellipses  along  the  4  axes  of  orientation  tt)  indicate  the  filter  frequency  band¬ 

width  and  their  widths,  the  filter  orientation  bandwidth  (both  measured  at  half-amplitude).  The  new  SI  cells  are 
more  tightly  tuned  for  both  orientation  and  frequency  but  cover  a  wider  range  of  spatial  frequencies. 


Model  units'  peak  frequencies  were  in  the  range  1.6- 

9.8  cycles/degree  (mean  and  median  value  of  3.7  and 

2.8  cycles/ degree  respectively).  This  provides  a  reason¬ 
able  fit  with  cortical  simple  cells  peak  frequencies  lying 
between  values  as  extreme  as  0.5  and  8.0  degree/ cycles 
but  a  bulk  around  1. 0-4.0  cycles/ degree  (mean  value  of 

2.2  cycles /degree)  [9].  Indeed,  using  our  formula  (3)  to 
parameterize  Gabor  filters  (see  section  2.1.1),  a  cell  with 
a  peak  frequency  around  0.5  cycles/degree  would  have 
a  receptive  field  size  of  about  2°  which  is  very  large 
compared  to  values  reported  in  [8, 12]  for  simple  cells. 

Spatial  frequency  bandwidths  measured  at  half¬ 
amplitude  were  all  in  the  range  1.1-1. 8  octaves,  which 
corresponds  to  a  subset  of  the  range  exhibited  by  cor¬ 
tical  simple  cells  (values  reported  as  extreme  as  0.4  to 
values  above  2.6  octaves).  For  the  sake  of  simplicity, 
we  tried  to  capture  the  range  of  "bulk  frequency  band- 
widths"  (1-1.5  octaves  for  parafoveal  cells)  and  focused 
on  population  median  values  (1.45  for  both  cortical  [9]) 
and  model  cells).  For  comparison  with  Schiller  et  al. ,  we 
measured  the  spatial  frequency  index  and  found  values 
in  the  range  44-58  (median  55)  which  lies  right  in  the 
bulk  (40-70)  reported  in  [10]. 

Cl  units.  Peak  frequencies  ranged  from  1.8  to  7.8  cy¬ 
cles/  degree  (mean  value  and  median  values  of  3.9  and 

3.2  respectively)  for  our  model  complex  cells.  In  [9], 
peak  frequencies  range  between  values  as  extreme  as 
0.5  and  8  cycles/degree  with  a  bulk  of  cells  lying  be¬ 
tween  2-5.6  cycles/ degree  (mean  around  3.2). 

We  found  spatial  frequency  bandwidths  at  half¬ 
amplitude  in  the  range  1. 5-2.0  octaves.  Parafoveal  com¬ 
plex  cells  lie  between  values  as  extreme  as  0.4  to  val¬ 
ues  above  2.6  octaves.  Again,  we  tried  to  capture  the 
bulk  frequency  bandwidths  ranging  between  1.0  and 
2.0  octaves  and  matched  the  median  values  for  the  pop¬ 


ulations  of  model  and  cortical  cells  [9]  (1.6  octaves  for 
both).  The  spatial  frequency  bandwidth  at  71%  maxi¬ 
mal  response  were  in  the  range  40-50  (median  48)  which 
lies  within  the  bulk  (40-60)  reported  in  [10].  Fig  6  shows 
the  complex  vs.  simple  cells  spatial  frequency  band- 
widths. 

3.2.2  Orientation  tuning 

SI  units.  We  found  a  median  orientation  bandwidth 
at  half  amplitude  of  44°  (range  38°  -49° ).  In  [11],  a  me¬ 
dian  value  of  34°  is  reported.  Again,  as  already  men¬ 
tioned  earlier,  this  value  seems  surprising  (it  would  im¬ 
ply  that  parafoveal  cells  are  more  tightly  tuned  than 
their  foveal  homologue,  both  simple  (median  value 
42°  )  and  complex  (45°  ).  When  we  used  instead  a  mea¬ 
sure  of  the  bandwidth  at  71%  of  the  maximal  response 
for  comparison  with  Schiller  et  al.  ,  the  fit  was  better 
with  a  median  value  of  30°  (range:  27°  -33°  )  compared 
with  a  bulk  of  cortical  simple  cells  within  20°  -70°  [13]. 

Cl  units.  We  found  a  median  orientation  bandwidth 
at  half  amplitude  of  43°  which  is  in  excellent  agree¬ 
ment  with  the  44°  reported  in  [11].  The  bulk  of  cells 
reported  in  [13]  is  within  20°  -90°  and  our  values  range 
between  27°  -33°  (median  31°  ),  therefore  placing  our 
model  units  as  part  of  the  most  narrowly  tuned  sub¬ 
population  of  cortical  complex  cells.  As  in  both  exper¬ 
imental  data  sets,  the  orientation  tuning  bandwidth  of 
the  model  complex  units  is  very  similar  to  that  of  simple 
units. 

3.2.3  Summary 

We  found  that  model  simple  SI  cells  in  the  original 
HMAX  were  too  broadly  tuned  to  both  orientation  and 
spatial  frequency  compared  to  cortical  simple  cells  (see 
section  3.1).  We  motivated  the  use  of  Gabor  filters  for 
simple  SI  cells  and  empirically  determined  a  set  of  pa- 
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Figure  6:  Complex  cells  spatial  frequency  bandwidth 
vs.  simple  cells  spatial  frequency  bandwidth.  There 
is  an  increase  of  about  20%  from  simple  to  com¬ 
plex  cells  spatial  frequency  bandwidth,  consistent  with 
parafoveal  cortical  cells  [9, 10]. 


rameters  so  that  model  simple  cells  tuning  properties 
match  those  of  cortical  simple  cells  (see  section  3.2). 

The  new  set  of  SI  cells  differ  from  SI  cells  in  stan¬ 
dard  HMAX  with  respect  to  their  orientation  bandwidth 
(median  46°  vs.  97°  in  standard  HMAX  ),  their  peak  fre¬ 
quencies  selectivity  (1. 6-9.8  cycles/degree  vs.  1. 4-5.8  cy¬ 
cles/degree  for  standard  HMAX  ),  frequency  selectivity 
bandwidth  (median  1.47  vs.  1.7  in  standard  HMAX  )  and 
receptive  field  sizes  (0.2°  -1.1°  vs.  0.2°  -0.8°  in  standard 
HMAX  ).  The  new  set  of  SI  cells  is  more  narrowly  tuned 
to  both  spatial  frequency  and  orientation,  span  a  larger 
range  of  frequencies  and  receptive  field  sizes  (see  Fig.  5) 
and  match  more  closely  parafoveal  simple  cells  tuning 
properties. 

It  also  appeared  from  our  study  that  the  pooling 
mechanisms  inferred  in  the  model  for  building  com¬ 
plex  cells  from  simple  cells  are  indeed  consistent  with 
complex  cells  tuning  properties.  A  comparison  between 
HMAX  and  parafoveal  complex  cells,  showed  that  po¬ 
sition  invariance  (parameterized  by  the  variable  Pool- 
Range  (see  section  2.1)  is  actually  larger  than  in  standard 
HMAX,  while  scale  invariance  (parameterized  by  the 
variable  ScaleRange  (see  section  2.1)  is  actually  smaller 
than  in  standard  HMAX. 

The  new  set  of  Cl  cells  differ  from  Cl  cells  in  stan¬ 
dard  HMAX  in  terms  of  their  orientation  bandwidth 
(median  43°  vs.  97°  in  standard  HMAX  ),  their  peak  fre¬ 
quencies  selectivity  (1. 8-7.8  cycles/ degree  vs.  1.6-5. 6  cy¬ 
cles/degree  for  standard  HMAX  ),  frequency  selectivity 
bandwidth  (median  1.6  i;s.  2.1  in  standard  HMAX  )  and 
receptive  field  sizes  (0.4°  -1.7°  vs.  0.3°  -1.1°  in  standard 
HMAX ).  The  new  set  of  Cl  cells  is  more  narrowly  tuned 
to  both  spatial  frequency  and  orientation,  span  a  larger 
range  of  frequencies  (see  Fig.  5)  and  match  more  closely 
parafoveal  complex  cells  tuning  properties. 


3.3  Performance  on  a  benchmark  recognition  task 

To  investigate  the  impact  of  this  new  representation 
on  the  VTUs'  shape  specificity  as  well  as  invariance 
to  shift,  size  and  rotation,  we  performed  a  bench¬ 
mark  recognition  task  with  paperclip  stimuli  (see  sec¬ 
tion  2.3.5)  similar  to  the  one  used  in  [3,  4]  and  found 
that  invariance  properties  were  maintained.  This  sug¬ 
gests  that  the  architecture  in  HMAX  is  robust  to  changes 
in  the  tuning  properties  of  cells  at  the  entry-level. 

VTUs  with  the  new  sets  of  SI  and  Cl  units  had 
a  mean  invariance  to  rotation  in  depth  of  about 
34°  (reference  being  on  the  same  task  33°  for  HMAX  and 
29°  reported  by  [4]  for  IT  cells).  For  size  invariance,  we 
found  a  bandwidth  of  about  2.8  octaves  (reference  be¬ 
ing  at  least  2.4  octaves  for  standard  HMAX  and  about 
2  octaves  for  IT  cells  [3]).  Translation  invariance  was 
maintained  with  respect  to  all  positions  tested  across 
the  units  receptive  field  compared  to  distracters  at  the 
center  of  the  receptive  field. 

We  also  quantified  the  effect  of  the  complex  (Cl) 
units  receptive  field  size  (controlled  by  the  variable 
PoolRange,  see  section  2.1)  on  VTU  scale  invariance 
properties.  We  found  that  larger  complex  cells  receptive 
field  sizes  lead  to  larger  scale  invariance  at  the  VTUs 
level  (see  Fig.  7). 

4  Discussion 

4.1  Impact  of  the  new  SI  and  Cl  cells  population  on 
HMAX  architecture 

We  proposed  a  new  set  of  receptive  field  shapes  and 
parameters  for  cells  in  the  SI  and  Cl  layers  of  HMAX  . 
We  increased  position  invariance  (parameterized  by 
the  variable  PoolRange,  see  section  2.1)  in  model  Cl 
cells  while  scale  invariance  (parameterized  by  the  vari¬ 
able  ScaleRange)  was  decreased  compared  to  standard 
HMAX. 

We  showed  that  invariance  properties  at  the  VTU 
level  were  not  substantially  affected  by  these  changes, 
indicating  that  the  model  appears  to  be  robust  to 
changes  in  the  lower  level  of  the  hierarchy.  Thus  there 
exists  a  number  of  different  pooling  schemes  between 
SI  and  C2  cells  that  still  account  for  VTU  invariance 
properties. 

We  showed  that  a  mechanism  in  which  cells  pool 
over  afferents  tuned  to  the  same  preferred  features  but 
at  slightly  different  positions  and  scales  is  well  suited 
to  explain  the  increase  in  receptive  field  size  and  spa¬ 
tial  frequency  bandwidth  from  simple  to  complex  cells. 
As  we  mentioned  in  section  2.1,  the  following  S2  layer 
in  the  model  (equivalent  to  V4  in  primate  cortex)  was 
only  qualitatively  inspired  by  physiological  data.  Fur¬ 
ther  studies  should  focus  on  intermediate  visual  areas 
such  as  V4  in  which  it  was  shown  that  the  increase  in 
receptive  field  sizes  and  spatial  frequency  bandwidth 
are  even  more  pronounced. 
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Figure  7:  Effect  of  the  variable  PoolRange  on  VTU  scale 
invariance  properties.  Plotted  are  the  mean  VTU  scale 
invariance  vs.  the  increase  in  receptive  field  size  from 
simple  to  complex  cells  when  increased  by  step  of  2  and 
4  between  pooling  bands. 

Schein  &  Desimone  showed  that  the  spatial  fre¬ 
quency  bandwidth  median  value  at  the  V4  level  was 
about  2.2  octaves  [25]  (which  represents  an  increase  of 
about  40%  from  the  complex  cells  population).  It  is 
not  clear  whether  this  remains  consistent  with  the  four 
layer  architecture  of  HMAX  and  further  investigation  on 
C2  units  tuning  properties  should  be  performed.  Also 
it  has  been  shown  in  [25]  that  V4  contains  cells  cov¬ 
ering  a  wide  range  of  tuning  properties  (from  0.5  to 
>4.0  octaves  spatial  frequency  band  widths).  Although 
this  could  be  an  artifact  of  their  methods  (probing  cells 
with  the  wrong  stimuli),  it  is  possible  that  direct  pooling 
from  Cl  to  C2  should  be  added. 

We  have  thus  shown  that  the  physiological  data  on 
simple  and  complex  cell  receptive  field  size,  spatial  fre¬ 
quency  and  orientation  bandwidth  are  in  good  agree¬ 
ment  with  the  model  hypothesis  of  complex  cells  per¬ 
forming  a  MAX  pooling  over  simple  cell  afferents,  a  key 
step  in  the  model  towards  invariant  object  recognition. 

Invariance,  i.e.,  the  ability  to  recognize  a  pattern  un¬ 
der  various  transformations,  is  one  goal  of  object  recog¬ 
nition,  another  one  being  specificity,  i.e.,  the  ability  to 
discriminate  between  different  patterns.  The  next  chal¬ 
lenge  is  to  understand  how  shape  complexity  is  in¬ 
creased  along  the  ventral  visual  stream,  from  the  Gabor- 
like  preferred  stimuli  in  VI  to  neurons  tuned  to  complex 
real-world  stimuli  such  as  faces  and  hands  in  IT. 
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